15. Implementation
Implementation: Policy Improvement
In the last lesson, you learned that given an estimate Q of the action-value function q_\pi corresponding to a policy \pi, it is possible to construct an improved (or equivalent) policy \pi', where \pi'\geq\pi.
For each state s\in\mathcal{S}, you need only select the action that maximizes the action-value function estimate. In other words,
\pi'(s) = \arg\max_{a\in\mathcal{A}(s)}Q(s,a) for all s\in\mathcal{S}.
The full pseudocode for policy improvement can be found below.

In the event that there is some state s\in\mathcal{S} for which \arg\max_{a\in\mathcal{A}(s)}Q(s,a) is not unique, there is some flexibility in how the improved policy \pi' is constructed.
In fact, as long as the policy \pi' satisfies for each s\in\mathcal{S} and a\in\mathcal{A}(s):
\pi'(a|s) = 0 if a \notin \arg\max_{a'\in\mathcal{A}(s)}Q(s,a'),
it is an improved policy. In other words, any policy that (for each state) assigns zero probability to the actions that do not maximize the action-value function estimate (for that state) is an improved policy. Feel free to play around with this in your implementation!
Please use the next concept to complete Part 3: Policy Improvement of Dynamic_Programming.ipynb
. Remember to save your work!
If you'd like to reference the pseudocode while working on the notebook, you are encouraged to open this sheet in a new window.
Feel free to check your solution by looking at the corresponding section in Dynamic_Programming_Solution.ipynb
.